Search Results: "gfa"

30 September 2015

Chris Lamb: Free software activities in September 2015

Inspired by Rapha l Hertzog, here is a monthly update covering a large part of what I have been doing in the free software world:
Debian The Reproducible Builds project was also covered in depth on LWN as well as in Lunar's weekly reports (#18, #19, #20, #21, #22).
Uploads
  • redis A new upstream release, as well as overhauling the systemd configuration, maintaining feature parity with sysvinit and adding various security hardening features.
  • python-redis Attempting to get its Debian Continuous Integration tests to pass successfully.
  • libfiu Ensuring we do not FTBFS under exotic locales.
  • gunicorn Dropping a dependency on python-tox now that tests are disabled.



RC bugs


I also filed FTBFS bugs against actdiag, actdiag, bangarang, bmon, bppphyview, cervisia, choqok, cinnamon-control-center, clasp, composer, cpl-plugin-naco, dirspec, django-countries, dmapi, dolphin-plugins, dulwich, elki, eqonomize, eztrace, fontmatrix, freedink, galera-3, golang-git2go, golang-github-golang-leveldb, gopher, gst-plugins-bad0.10, jbofihe, k3b, kalgebra, kbibtex, kde-baseapps, kde-dev-utils, kdesdk-kioslaves, kdesvn, kdevelop-php-docs, kdewebdev, kftpgrabber, kile, kmess, kmix, kmldonkey, knights, konsole4, kpartsplugin, kplayer, kraft, krecipes, krusader, ktp-auth-handler, ktp-common-internals, ktp-text-ui, libdevice-cdio-perl, libdr-tarantool-perl, libevent-rpc-perl, libmime-util-java, libmoosex-app-cmd-perl, libmoosex-app-cmd-perl, librdkafka, libxml-easyobj-perl, maven-dependency-plugin, mmtk, murano-dashboard, node-expat, node-iconv, node-raw-body, node-srs, node-websocket, ocaml-estring, ocaml-estring, oce, odb, oslo-config, oslo.messaging, ovirt-guest-agent, packagesearch, php-svn, php5-midgard2, phpunit-story, pike8.0, plasma-widget-adjustableclock, plowshare4, procps, pygpgme, pylibmc, pyroma, python-admesh, python-bleach, python-dmidecode, python-libdiscid, python-mne, python-mne, python-nmap, python-nmap, python-oslo.middleware, python-riemann-client, python-traceback2, qdjango, qsapecng, ruby-em-synchrony, ruby-ffi-rzmq, ruby-nokogiri, ruby-opengraph-parser, ruby-thread-safe, shortuuid, skrooge, smb4k, snp-sites, soprano, stopmotion, subtitlecomposer, svgpart, thin-provisioning-tools, umbrello, validator.js, vdr-plugin-prefermenu, vdr-plugin-vnsiserver, vdr-plugin-weather, webkitkde, xbmc-pvr-addons, xfsdump & zanshin.

19 August 2015

Russ Allbery: rra-c-util 5.8

Another collection of random bug fixes in my general C utility library. Fixes a missing va_end, a segfault in buffer_find_string with new buffers, and how relative paths are computed in Test::RRA::Automake. Also adds a portability layer for the MIT Kerberos kadm5_init_krb5_context function. You can get the latest release from the rra-c-util distribution page.

Russ Allbery: C TAP Harness 3.4

Nothing particularly exciting in this release of my C testing framework, but aherbert on Github found a segfault in the runtests driver with test lists that had only blank lines and comments. Since I'm releasing some other software anyway, that seemed to be worth a release. While preparing the release, I found that the test for spelling errors in the POD documentation had some bad assumptions about how to canonicalize paths, so you may want to grab a new copy of that if you're using it in your projects. You can get the latest release from the C TAP Harness distribution page.

17 July 2015

Dirk Eddelbuettel: RcppTOML 0.0.4

We introduced RcppTOML in April with the initial CRAN release 0.0.3. It permits R to read the (absolutely awesome) TOML format which is simply fabulous for configuration files. A new version appeared on CRAN yesterday. We had observed a somewhat rare segfault in our production use which came down me dereferencing a list iterator which checking length first. Ooops. As usual, a few other changes were made as, mostly to stay on the good side of R CMD check --as-cran for the development version of R. Courtesy of CRANberries, there is also a diffstat report for this release More information is on the RcppRedis page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

21 April 2015

Manuel A. Fernandez Montecelo: About the Debian GNU/Linux port for OpenRISC or1k

In my previous post I mentioned my involvement with the OpenRISC or1k port. It was the technical activity in which I spent most time during 2014 (Debian and otherwise, day job aside). I thought that it would be nice to talk a bit about the port for people who don't know about it, and give an update for those who do know and care. So this post explains a bit how it came to be, details about its development, and finally the current status. It is going to be written as a rather personal account, for that matter, since I did not get involved enough in the OpenRISC community at large to learn much about its internal workings and aspects that I was not directly involved with. There is not much information about all of this elsewhere, only bits and pieces scattered here and there, but specially not much public information at all about the development of the Debian port. There is an OpenRISC entry in the Debian wiki, but it does not contain much information yet. Hopefully, this piece will help a bit to preserve history and give an insight for future porters. First Things First I imagine that most people reading this post will be familiar with the terminology, but just in case, to create a new Debian port means to get a Debian system (GNU/Linux variant, in this case) to run in the OpenRISC or1k computer architecture. Setting to one side all differences between hardware and software, and as described in their site:
The aim of the OpenRISC project is to create free and open source computing platforms
It is therefore a good match for the purposes of Debian and Free Software world in general. The processor has not been produced in silicon, or not available for the masses in any case. People with the necessary know-how can download the hardware description (Verilog) and synthesise it in a FPGA, or otherwise use simulators. It is not some piece of hardware that people can purchase yet, and there are no plans to mass-produce it in the near future either. The two people (including me) involved in this Debian port did not have the hardware, so we created the port entirely through cross-compiling from other architectures, and then compiling inside Qemu. In a sense, we were creating a Debian port for hardware that "does not [phisically] exist". The software that we built was tested by people who had hardware available in FPGA, though, so it was at least usable. I understand that people working in the arm64 port had to work similarly in the initial phases, working in the dark without access to real hardware to compile or test. The Spark The first time that I heard about the initiative to create the port was in late February of 2014, in a post which appeared in Linux Weekly News (sent by Paul Wise) and Slashdot. The original post announcing it was actually from late January, from Christian Svensson (blueCmd):
Some people know that I've been working on porting Glibc and doing some toolchain work. My evil master plan was to make a Debian port, and today I'm a happy hacker indeed! Below is a link to a screencast of me installing Debian for OpenRISC, installing python2.7 via apt-get (which you shouldn't do in or1ksim, it takes ages! (but it works!)) and running a small Python script. http://asciinema.org/a/7362
So, now, what can a Debian Hacker do when reading this? (Even if one's Hackery Level is not that high, as it is my case). And well, How Hard Can It Be? I mean, Really? Well, in my own defence, I knew that the answer to the last two questions would be a resounding Very . But for some reason the idea grabbed me and I couldn't help but think that it would be a Really Exciting Project, and that somehow I would like to get involved. So I wrote to Christian offering my help after considering it for a few days, around mid March, and he welcomed me aboard. The Ball Was Already Rolling Christian had already been in contact with the people behind DebianBootstrap, and he had already created the repository http://openrisc.debian.net/ with many packages of the base system and beyond (read: packages name_version_or1k.deb available to download and install). Still nowadays the packages are not signed with proper keys, though, so use your judgement if you want to try them. After a few weeks, I got up to speed with the status of the project and got my system working with the necessary tools. This meant basically sbuild/schroot to compile new packages, with the base system that Christian already got working, installed in a chroot, probably with the help of debootstrap, and qemu-system-or1k to simulate the system. Only a few of the packages were different from the version in Debian, like gcc, binutils or glibc -- they had not been upstreamed yet. sbuild ran through qemu-system-or1k, so the compilation of new packages could happen "natively" (running inside Qemu) rather than cross-compiling the packages, pulling _or1k.deb packages for dependencies from the repository that he had prepared, and _all.deb packages from snapshots.debian.org. I started by trying to get the packages that I [co-]maintain in Debian compiled for this architecture, creating the corresponding _or1k.deb. For most of them, though, I needed many dependencies compiled before I could even compile my packages. The GNU autotools / autoreconf Problem Since very early, many of the packages failed to build with messages such as:
Invalid configuration 'or1k-linux-gnu': machine 'or1k' not recognized
configure: error: /bin/bash ../config.sub or1k-linux-gnu failed
This means that software packages based on GNU autotools and using configure scripts need recent versions of the files config.sub and config.guess that they ship in their root directory, to be able to detect the architecture and generate the code accordingly. This is counter-intuitive, having into account that GNU autotools were designed to help with portability. Yet, in the case of creating new Debian ports, it meant that unless upstream had very recent versions of config. guess,sub , it would prevent the package to compile straight away in the new architectures -- even if invoking gcc without ado would have worked without problems in most cases for native compilation. Of course this did not only affect or1k, and there was already the autoreconf effort underway as a way to update these files automatically when building Debian packages, pushed by people porting Debian to the new architectures added in 2013/2014 (mips64el, arm64, ppc64el), which encountered the same roadblock. This affected around a thousand source packages in unstable. A Royal Pain. Also, all of their reverse dependencies (packages that depended on those to be built) could not be compiled straight away. The bugs were not Release Critical, though (none of these architectures were officially accepted at the time), so for people not concerned with the new ports there was no big incentive to get them fixed. This problem, which conceptually is easily solvable, prevented new ports to even attempt compile vast portions of the archive straight away (cleanly, without modifications to the package or to the host system), for weeks or months. The GNU autotools / autoreconf Solution We tackled this problem mainly in two ways. First, more useful for Debian in general, was to do as other porters were doing and submit bug reports and patches to Debian packages requesting them to use autoreconf, and NMUing packages (uploading changes to the archive without the official maintainers' intervention). A few NMUs were made for packages which had bug reports with patches available for a while, that were in the critical path to get many other packages compiled, and that were orphan or had almost no maintainer activity. The people working in the other new ports, and mainly Ubuntu people which helped with some of those ports and wanted to support them, had submitted a large amount of requests since late 2013, so there was no shortage of NMUs to be made. Some porters, not being Debian Developers, could not easily get the changes applied; so I also helped a bit the porters of other architectures, specially later on before the freeze of Jessie, to get as many packages compiled in those architectures as possible. The second way was to create dpkg-buildpackage hooks that updated unconditionally config. guess,sub before attempting to build the package in the local build system. This local and temporary solution allowed us in the or1k port to get many _or1k.deb packages in the experimental repository, which in turn would allow many more packages to compile. After a few weeks, I set up many sbuilds in a multi-core machine attempting to build uninterruptedly packages that were not previously built and which had their dependencies available. Every now and then (typically several times per day in peak times) I pushed the resulting _or1k.deb files to the repository, so more packages would have the necessary dependencies ready to attempt to build. Christian was doing something similar, and by April at peak times, among the two of us, we were compiling some days more than a hundred packages -- a huge amount of packages did not need any change other than up-to-date config. guess,sub files. At some point, late April, Christian set up wanna-build in a few hosts to do this more elegantly and smartly than my method, and more effectively as well. Ugly Hacks, Bugs and Shortcomings in the Toolchain and Qemu Some packages are extremely important because many other packages need them to compile (like cmake, Qt or GTK+), and they are themselves very complex and have dependency loops. They had deeper problems than the autoreconf issue and needed some seriously dirty hacking to get them built. To try to get as many packages compiled as possible, we sometimes compiled these important packages with some functionality disabled, disabling some binary packages (e.g. Java bindings) or specially disabling documentation (using DEB_BUILD_OPTIONS=nodoc when possible, and more aggressively when needed by removing chunks of debian/rules). I tried to use the more aggressive methods in as few packages as possible, though, about a dozen in total. We also used DEB_BUILD_OPTIONS=nocheck for speeding up compilation and avoiding build failures -- many packages' tests failed due to qemu-system-or1k not supporting multi-threading, which we could do nothing about at the time, but otherwise the packages mostly passed tests fine. Due to bugs and shortcomings in Qemu and the toolchain --like the compiler lacking atomics, missing functionality in glibc, Qemu entering in endless loops, or programs segfaulting (especially gettext, used by many packages and causing the packages failing to build)--, we had to resort to some very creative ways or time-consuming dull work to edit debian/rules, or to create wrappers of the real programs avoiding or forcing certain options (like gcc -O0, since -O2 made buggy binaries too often). To avoid having a mix of cleanly compiled and hacked packages in the same repository, Christian set up a two-tiered repository system -- the clean one and the dirty one. In the dirty one we dumped all of the packages that we got built, no matter how. The packages in the clean one could use packages from the dirty one to build, but they themselves were compiled without any hackery. Of course this was not a completely airtight solution, since they could contain code injected at build time from the "dirty repository" (e.g. by static linking), and perhaps other quirks. We hoped to get rid of these problems later by rebuilding all packages against clean builds of all their dependencies. In addition, Christian also spent significant amounts of time working inside the OpenRISC community, debugging problems, testing and recompiling special versions of the toolchain that we could use to advance in our compilation of the whole archive. There were other people in the OpenRISC community implementing the necessary bits in the toolchain, but I don't know the details. Good Progress Olof Kindgren wrote the OpenRISC health report April 2014 (actually posted in May), explaining the status at the time of projects in the broad OpenRISC community, and talking about the software side, Debian port included. Sadly, I think that there have been no more "health reports" since then. There was also a new post in Slashdot entitled OpenRISC Gains Atomic Operations and Multicore Support shortly thereafter. In the side of the Debian port, from time to time new versions of packages entered unstable and we started to use those newer versions. Some of them had nice fixes, like the autoreconf updates, so they did not require local modifications. In other cases, the new versions failed to build when old ones had worked (e.g. because the newer versions added support and dependencies of new versions of gnutls, systemd or other packages not yet available for or1k), and we had to repeat or create more nasty hacks to get the packages built again. But in general, progress was very good. There were about 10k arch-dependent packages in Debian at the time, and we got about half of them compiled by the beginning of May, counting clean and dirty. And, if I recall correctly, there were around the same number of arch=all (which can be installed in any architecture, after the package is built in one of them). Counting both, it meant that systems using or1k got about 15k packages available, or 75% of the whole Debian archive (at least "main", we excluded "contrib" and "non-free"). Not bad. By the middle to end of May, we had about 6k arch-dependent packages compiled, and 4k to go. The count of packages eventually reached ~6.6k at its peak (I think that in June/July). Many had been built with hacks and not rebuilt cleanly yet, but everything was fine until the amount of packages built plateaued. Plateauing There were multiple reasons for that. One of them is that after having fixed the autoreconf issue locally in some packages, new versions were uploaded to Debian without fixing that problem (in many cases there was no bug report or patch yet, so it was understandable; in other cases the requests were ignored). The wanna-build for the clean repository set up by Christian rightly considered the package out-of-date and prepared to build the more recent version, that failed. Then, other packages entering the unstable archive and build-depending on newer versions of those could not be built ("BD-Uninstallable"), until we built the newer versions of the dependencies in the dirty repository with local hacks. Consequently, the count of cleanly built packages went back-and-forth, when not backwards. More challenging was the fact that when creating a new port, language compilers which are written in that same language need to be built for that architecture first. Sometimes it is not the compiler, but compile-time or run-time support for modules of a language are not ported yet. Obviously, as with other dependencies, large amounts of packages written in those languages are bound to remain uncompiled for a long time. As Colin Watson explained in porting Haskell's GHC to arm64 and ppc64el, untangling some of the chicken-and-egg problems of language compilers for new ports is extremely challenging. Perl and Python are pretty much a pre-requisite of the base Debian system, and Christian got them working early on. But for example in May, 247 packages depended on r-base-dev (GNU R) for building, and 736 on ghc, and we did not have these dependencies compiled. Just counting those two, 1k source packages of the remaining 4k to 5k to be compiled for the new architecture would have to wait for a long time. Then there was Java, Mono, etc... Even more worrying problems were the pending issues with the toolchain, like atomics in glibc, or make check failing for some packages in the clean repository built with wanna-build. Christian continued to work on the toolchain and liasing with the rest of the OpenRISC community, I continued to request more changes to the Debian archive through a few requests to use autoreconf, and pushing a few more NMUs. Though many requests were attended, I soon got negative replies/reactions and backed off a bit. In the meantime, the porters of other new architectures at the time were mostly submitting requests to support them and not NMUing much either. Upstreaming Things continued more or less in the same state until the end of the summer. The new ports needed as many packages built as possible before the evaluation of which official ports to accept (in early September, I think, the final decision around the time of the freeze). Porters of the other new architectures (and maintainers, and other helpful Debian Developers) were by then more active in pushing for changes, specially remaining autoreconf issues, many of which benefited or1k. As I said before, I also kept pushing NMUs now and then, specially during summer, for packages which were not of immediate benefit for our port but helped the others (e.g. ppc64el needed updates to ltmain.sh for libtool which were not necessary for or1k, in addition to config. guess,sub ). In parallel in the or1k camp, there were patches that needed changes to be sent upstream, like for example Python's NumPy, that I submitted in May to the Debian package and upstream, and was uploaded to Debian in September with a new upstream release. Similar paths were followed between May and September for packages such as jemalloc, ocaml, gstreamer0.10, libgc, mesa, X.org's cf module and cmake (patch created by Christian). In April, Christian had reached the amazing milestone of tracking and getting all of the contributors of the port of GNU binutils to assign copyright to the Free Software Foundation (FSF), all of the work was refreshed and upstreamed. In July or August, he started to gather information about the contributors of the GCC port, which had started more than a decade ago. After that, nothing much happened (from the outside) until the end of the year, when Christian sent a message about the status of upstreaming GCC to the OpenRISC community. There was only one remaining person to assign the copyright to the FSF, but it was a blocker. In addition, there was the need to find one or more maintainers to liaise with upstream, review the patches, fix the remaining failures in the test suite and keeping the port in good shape. A few months after that and from what I could gather, the status remains the same. Current Status, and The Future? In terms of the Debian port, there have not been huge visible changes since the end of the summer, and not only because of the Jessie freeze. It seems that for this effort to keep going forward and be sustainable, sorting out the issues with GCC and Glibc is essential. That means having the toolchain completely pushed upstream and in good shape, and particularly completing the copyright assignment. Debian will not accept private forks of those essential packages without a very good reason even in unofficially supported ports; and from the point of view of porters, working in the remaining not-yet-built packages with continuing problems in the toolchain is very frustrating and time-consuming. Other than that, there is already a significant amount of software available that could run in an or1k system, so I think that overall the project has achieved a significant amount of success. Granted, KDE and LibreOffice are not available yet, neither are the tools depending on Haskell or Java. But a lot of software is available (including things high in the stack, like XFCE), and in many aspects it should provide a much more functional system that the one available in Linux (or other free software) systems in the late 1990s. If the blocking issues are sorted out in the near future, the effort needed to get a very functional port, on par with the unofficial Debian ports, should not be that big. In my opinion, and looking at the big picture, not bad at all for an architecture whose hardware implementation is not easy to come by, and in which the port was created almost solely with simulators. That it has been possible to get this far with such meagre resources, it's an amazing feat of Free Software and Debian in particular. As for the future, time will tell, as usual. I will try to keep you posted if there is any significant change in the future.

27 March 2015

Michal Čihař: Porting python-gammu to Python 3

Over the time I started to get more and more requests to have python-gammu working with Python 3. Of course this request makes sense, but I somehow failed to find time for that. Also for quite some time python-gammu has been distributed together with Gammu sources. This was another struggle to overcome when supporting Python 3 as in many cases users will want to build the module for both Python 2 and 3 (at least most distributions will want to do so) and with current CMake based build system this did not seem to be easy to achieve. So I've decided it's time to split python module out of the library. The reasons for having that together are no longer valid (libGammu has quite stable API these days) and having standard module which can be installed by pip is a nice thing. Once the code has been put into separate git module, I've slowly progressed on porting to Python 3. Most of the problems were on the C side of the code, where Python really does not make it easy to support both Python 2 and 3. So the code ended up with many #ifdefs, but I see no other way. While doing these changes, many points in the API were fixed to accept unicode stings in Python 2 as well. Anyway, today we have first successful build of python-gammu working on both Python 2 and 3. I'm afraid there is still some bug leading to occasional segfaults on Travis, but not reproducible locally. But hopefully this will be fixed in upcoming weeks and we can release separate python-gammu module again.

Filed under: English Gammu python-gammu Wammu 0 comments

23 February 2015

Enrico Zini: akonadi-build-hth

The wonders of missing documentation Update: I have managed to build an example Akonadi client application. I'm new here, I want to make a simple C++ GUI app that pops up a QCalendarWidget which my local Akonadi has appointments. I open qtcreator, create a new app, hack away for a while, then of course I get undefined references for all Akonadi symbols, since I didn't tell the build system that I'm building with akonadi. Ok. How do I tell the build system that I'm building with akonadi? After 20 minutes of frantic looking around the internet, I still have no idea. There is a package called libakonadi-dev which does not seem to have anything to do with this. That page mentions everything about making applications with Akonadi except how to build them. There is a package called kdepimlibs5-dev which looks promising: it has no .a files but it does have haders and cmake files. However, qtcreator is only integrated with qmake, and I would really like the handholding of an IDE at this stage. I put something together naively doing just what looked right, and I managed to get an application that segfaults before main() is even called:
/*
 * Copyright   2015 Enrico Zini <enrico@enricozini.org>
 *
 * This work is free. You can redistribute it and/or modify it under the
 * terms of the Do What The Fuck You Want To Public License, Version 2,
 * as published by Sam Hocevar. See the COPYING file for more details.
 */
#include <QDebug>
int main(int argc, char *argv[])
 
    qDebug() << "BEGIN";
    return 0;
 
QT       += core gui widgets
CONFIG += c++11
TARGET = wtf
TEMPLATE = app
LIBS += -lkdecore -lakonadi-kde
SOURCES += wtf.cpp
I didn't achieve what I wanted, but I feel like I achieved something magical and beautiful after all. I shall now perform some haruspicy on those oscure cmake files to see if I can figure something out. But seriously, people?

30 November 2014

Gregor Herrmann: RC bugs 2014/47-48

these are the RC bugs I've worked on during the last two weeks:

28 November 2014

Richard Hartmann: Release Critical Bug report for Week 48

Holy bug-count-drop, Batman! Some bugs which need loving: The UDD bugs interface currently knows about the following release critical bugs: How do we compare to the Squeeze release cycle?
Week Squeeze Wheezy Jessie
43 284 (213+71) 468 (332+136) 319 (240+79)
44 261 (201+60) 408 (265+143) 274 (224+50)
45 261 (205+56) 425 (291+134) 295 (229+66)
46 271 (200+71) 401 (258+143) 427 (313+114)
47 283 (209+74) 366 (221+145) 342 (260+82)
48 256 (177+79) 378 (230+148) 274 (189+85)
49 256 (180+76) 360 (216+155)
50 204 (148+56) 339 (195+144)
51 178 (124+54) 323 (190+133)
52 115 (78+37) 289 (190+99)
1 93 (60+33) 287 (171+116)
2 82 (46+36) 271 (162+109)
3 25 (15+10) 249 (165+84)
4 14 (8+6) 244 (176+68)
5 2 (0+2) 224 (132+92)
6 release! 212 (129+83)
7 release+1 194 (128+66)
8 release+2 206 (144+62)
9 release+3 174 (105+69)
10 release+4 120 (72+48)
11 release+5 115 (74+41)
12 release+6 93 (47+46)
13 release+7 50 (24+26)
14 release+8 51 (32+19)
15 release+9 39 (32+7)
16 release+10 20 (12+8)
17 release+11 24 (19+5)
18 release+12 2 (2+0)
Graphical overview of bug stats thanks to azhag:

23 October 2014

Erich Schubert: Clustering 23 mio Tweet locations

To test scalability of ELKI, I've clustered 23 million Tweet locations from the Twitter Statuses Sample API obtained over 8.5 months (due to licensing restrictions by Twitter, I cannot make this data available to you, sorry.
23 million points is a challenge for advanced algorithms. It's quite feasible by k-means; in particular if you choose a small k and limit the number of iterations. But k-means does not make a whole lot of sense on this data set - it is a forced quantization algorithm, but does not discover actual hotspots.
Density-based clustering such as DBSCAN and OPTICS are much more appropriate. DBSCAN is a bit tricky to parameterize - you need to find the right combination of radius and density for the whole world. Given that Twitter adoption and usage is quite different it is very likely that you won't find a single parameter that is appropriate everywhere.
OPTICS is much nicer here. We only need to specify a minimum object count - I chose 1000, as this is a fairly large data set. For performance reasons (and this is where ELKI really shines) I chose a bulk-loaded R*-tree index for acceleration. To benefit from the index, the epsilon radius of OPTICS was set to 5000m. Also, ELKI allows using geodetic distance, so I can specify this value in meters and do not get much artifacts from coordinate projection.
To extract clusters from OPTICS, I used the Xi method, with xi set to 0.01 - a rather low value, also due to the fact of having a large data set.
The results are pretty neat - here is a screenshot (using KDE Marble and OpenStreetMap data, since Google Earth segfaults for me right now):
Screenshot of Clusters in central Europe
Some observations: unsurprisingly, many cities turn up as clusters. Also regional differences are apparent as seen in the screenshot: plenty of Twitter clusters in England, and low acceptance rate in Germany (Germans do seem to have objections about using Twitter; maybe they still prefer texting, which was quite big in Germany - France and Spain uses Twitter a lot more than Germany).
Spam - some of the high usage in Turkey and Indonesia may be due to spammers using a lot of bots there. There also is a spam cluster in the ocean south of Lagos - some spammer uses random coordinates [0;1]; there are 36000 tweets there, so this is a valid cluster...
A benefit of OPTICS and DBSCAN is that they do not cluster every object - low density areas are considered as noise. Also, they support clusters of different shape (which may be lost in this visualiation, which uses convex hulls!) and different size. OPTICS can also produce a hierarchical result.
Note that for these experiments, the actual Tweet text was not used. This has a rough correspondence to Twitter popularity "heatmaps", except that the clustering algorithms will actually provide a formalized data representation of activity hotspots, not only a visualization.
You can also explore the clustering result in your browser - the Google Drive visualization functionality seems to work much better than Google Earth.
If you go to Istanbul or Los Angeles, you will see some artifacts - odd shaped clusters with a clearly visible spike. This is caused by the Xi extraction of clusters, which is far from perfect. At the end of a valley in the OPTICS plot, it is hard to decide whether a point should be included or not. These errors are usually the last element in such a valley, and should be removed via postprocessing. But our OpticsXi implementation is meant to be as close as possible to the published method, so we do not intend to "fix" this.
Certain areas - such as Washington, DC, New York City, and the silicon valley - do not show up as clusters. The reason is probably again the Xi extraction - these region do not exhibit the steep density increase expected by Xi, but are too blurred in their surroundings to be a cluster.
Hierarchical results can be found e.g. in Brasilia and Los Angeles.
Compare the OPTICS results above to k-means results (below) - see why I consider k-means results to be a meaningless quantization?
k-means clusters
Sure, k-means is fast (30 iterations; not converged yet. Took 138 minutes on a single core, with k=1000. The parallel k-means implementation in ELKI took 38 minutes on a single node, Hadoop/Mahout on 8 nodes took 131 minutes, as slow as a single CPU core!). But you can see how sensitive it is to misplaced coordinates (outliers, but mostly spam), how many "clusters" are somewhere in the ocean, and that there is no resolution on the cities? The UK is covered by 4 clusters, with little meaning; and three of these clusters stretch all the way into Bretagne - k-means clusters clearly aren't of high quality here.
If you want to reproduce these results, you need to get the upcoming ELKI version (0.6.5~201410xx - the output of cluster convex hulls was just recently added to the default codebase), and of course data. The settings I used are:
-dbc.in coords.tsv.gz
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-pagefile.pagesize 500
-spatial.bulkstrategy SortTileRecursiveBulkSplit
-time
-algorithm clustering.optics.OPTICSXi
-opticsxi.xi 0.01
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 1000
-resulthandler KMLOutputHandler -out /tmp/out.kmz
and the total runtime for 23 million points on a single core was about 29 hours. The indexes helped a lot: less than 10000 distances were computed per point, instead of 23 million - the expected speedup over a non-indexed approach is 2400.
Don't try this with R or Matlab. Your average R clustering algorithm will try to build a full distance matrix, and you probably don't have an exabyte of memory to store this matrix. Maybe start with a smaller data set first, then see how long you can afford to increase the data size.

12 October 2014

Mario Lang: soundCLI works again

I recently ranted about my frustration with GStreamer in a SoundCloud command-line client written in Ruby. Well, it turns out that there was quite a bit confusion going on. I still haven't figured out why my initial tries resulted in an error regarding $DISPLAY not being set. But now that I have played a bit with gst-launch-1.0, I can positively confirm that this was very likely not the fault of GStreamer. THe actual issue is, that ruby-gstreamer is assuming gstreamer-1.0, while soundCLI was still written against the gstreamer-0.10 API. Since the ruby gst module doesn't have the Gstreamer API version in its name, and since Ruby is a dynamic language that only detects most errors at runtime, this led to all sorts of cascaded errors. It turns out I only had to correct the use of query_position, query_duration, and get_state, as well as switching from playbin2 to playbin. soundCLI is now running in the background and playing my SoundCloud stream. A pull request against soundCLI has also been opened. On a somewhat related note, I found a GCC bug (ICE SIGSEGV) this weekend. My first one. It is related to C++11 bracketed initializers. Given that I have heard GCC 5.0 aims to remove the experimental nature of C++11 (and maybe also C++14), this seems like a good time to hit this one. I guess that means I should finally isolate the C++ regex (runtime) segfault I recently stumbled across.

8 October 2014

EvolvisForge blog: PSA: #shellshock still unfixed except in Debian unstable

I just installed, for work, Hanno B ck s bashcheck utility on our monitoring system, and watched all systems go blue. All but two. One is not executing remote scripts from the monitoring for security reasons, the other is my desktop which runs Debian sid (unstable). (Update, 2014-10-20: jessie, precise, trusty are also green now.) This means that all those distributions still have unfixed #shellshock bugs. I don t know if/when all distributions will have patched their packages but thought you d want to know the hysteria isn t over yet however, I hope you were not stupid enough to follow the advice of this site which suggests you to download some random file over the net and execute it with superuser permissions, unchecked. (I think the Ruby people were the first to spread this extremely insecure, stupid and reprehensible technique.) Updates: Thanks to tarent for letting me do this work during $dayjob time!

1 October 2014

Keith Packard: chromium-dri3

Chromium (the browser) and DRI3 I got a note on IRC a week ago that Chromium was crashing with DRI3. The Google team working on Chromium eventually sent me a link to the bug report. That's secret Google stuff, so you won't be able to follow the link, even though it's a bug in a free software application when running on free software drivers. There's a bug report in the freedesktop bugzilla which looks the same to me. In both cases, the recommended fix was to switch from DRI3 back to DRI2. That's not exactly a great plan, given that DRI3 offers better security between GPU-using applications, which seems like a pretty nice thing to have when you're running random GL applications from the web. Chromium Sandboxing I'm not entirely sure how it works, but Chromium creates a process separate from the main browser engine to talk to the GPU. That process has very limited access to the operating system via some fancy library adventures. Presumably, the hope is that security bugs in the GL driver would be harder to leverage into a remote system exploit. Debugging in this environment is a bit tricky as you can't simply run chromium under gdb and expect to be able to set breakpoints in the GL driver. Instead, you have to run chromium with a magic flag which causes the GPU process to pause before loading the driver so you can connect to it with gdb and debug from there, along with a flag that lets you see crashes within the gpu process and the usual flag that causes chromium to ignore the GPU black list which seems to always include the Intel driver for one reason or another:
$ chromium --gpu-startup-dialog --disable-gpu-watchdog --ignore-gpu-blacklist
Once Chromium starts up, it will print out a message telling you to attach gdb to the GPU process and send that process a SIGUSR1 to continue it. Now you can happily debug and get a stack trace when the crash occurs. Locating the Bug The bug manifested with a segfault at the first access to a DRI3-allocated buffer within the application. We've seen this problem in the past; whenever buffer allocation fails for some reason, the driver ignores the problem and attempts to de-reference through the (NULL) buffer pointer, causing a segfault. In this case, Chromium called glClear, which tried (and failed) to allocate a back buffer causing the i965 driver to subsequently segfault. We should probably go fix the i965 driver to not segfault when buffer allocation fails, but that wouldn't provide a lot of additional information. What I have done is add some error messages in the DRI3 buffer allocation path which at least tell you why the buffer allocation failed. That patch has been merged to Mesa master, and should also get merged to the Mesa stable branch for the next stable release. Once I had added the error messages, it was pretty easy to see what happened:
$ chromium --ignore-gpu-blacklist
[10618:10643:0930/200525:ERROR:nss_util.cc(856)] After loading Root Certs, loaded==false: NSS error code: -8018
libGL: pci id for fd 12: 8086:0a16, driver i965
libGL: OpenDriver: trying /local-miki/src/mesa/mesa/lib/i965_dri.so
libGL: Can't open configuration file /home/keithp/.drirc: Operation not permitted.
libGL: Can't open configuration file /home/keithp/.drirc: Operation not permitted.
libGL error: DRI3 Fence object allocation failure Operation not permitted
The first two errors were just the sandbox preventing Mesa from using my GL configuration file. I'm not sure how that's a security problem, but it shouldn't harm the driver much. The last error is where the problem lies. In Mesa, the DRI3 implementation uses a chunk of shared memory to hold a fence object that lets Mesa know when buffers are idle without using the X connection. That shared memory segment is allocated by creating a temporary file using the O_TMPFILE flag:
fd = open("/dev/shm", O_TMPFILE O_RDWR O_CLOEXEC O_EXCL, 0666);
This call cannot fail as /dev/shm is used by glibc for shared memory objects, and must therefore be world writable on any glibc system. However, with the Chromium sandbox enabled, it returns EPERM. Running Without a Sandbox Now that the bug appears to be in the sandboxing code, we can re-test with the GPU sandbox disabled:
$ chromium --ignore-gpu-blacklist --disable-gpu-sandbox
And, indeed, without the sandbox getting in the way of allocating a shared memory segment, Chromium appears happy to use the Intel driver with DRI3. Final Thoughts I looked briefly at the Chromium sandbox code. It looks like it needs to know intimate details of the OpenGL implementation for every possible driver it runs on; it seems to contain a fixed list of all possible files and modes that the driver will pass to open(2). That seems incredibly fragile to me, especially when used in a general Linux desktop environment. Minor changes in how the GL driver operates can easily cause the browser to stop working.

16 June 2014

Simon Kainz: on using crippled installers

Maybe someone else is using iDataPlex HW, so this might come in helpful... TL, DR: Dear Hardware Vendors! Please don't force people to use specific Linux distributions when a simple unzip command would do the same. At work we run (amongst others) an HPC system consisting of 120 IBM System X iDataPlex dx360 M4 Nodes. Recently, after some MCE logs caused by DIMM errors, the IBM guy refused to send me replacement DIMMs if if didn't update IMM firware and tortue test the machine some days more. Well, ok... IBM uses the so called UpdateXpress tool to update IMM and UEFI firmware. The tool is available for rhel4 to rhel7 (both 32 and 64 bit), SLES10 & SLES11 and Microsoft Windows. I tried all of the available UpdateXpress System Pack Installers, with basically the same result everytime:
WARNING! This package doesn't appear to match your system.
         The following information was determined for your system:
           distribution = Unknown
           release = -1
           processor architecture = Unknown
I tried to set up a fake RHEL environment (customized /usr/local/uname, custom /et/RHel-release, etc...) but after some more tries, the installer (yes, it's still the installer I am trying to run) fails with some strange SEGFAULTS. After some closer inspection, (well, the RHEL4 uxspi binary is about 109MB big!) i tried to unzip it, and voila:
~/firmware$ unzip ibm_utl_uxspi_9.60_rhel4_32-64.bin
Archive:  ibm_utl_uxspi_9.60_rhel4_32-64.bin
   creating: rhel6_64/image/
  inflating: rhel6_64/image/libpegasus.nsp.so  
  inflating: rhel6_64/image/libUxliteImmUsbLan.so  
  inflating: rhel6_64/image/eccConnect.properties  
  inflating: rhel6_64/image/libpegslp_client.so  
   creating: rhel6_64/image/esxi_assoc_templates/
  inflating: rhel6_64/image/esxi_assoc_templates/elxFC.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/elxCNA.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/ibmc.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/FPGA-S.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/degraded.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/brocadeFC.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/qlgcFC.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/broadcom.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/LSI.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/brocadeCNA.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/uefi.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/diags.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/qlgcCNA.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/VMwareESXi.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/ibmc2.uxt  
  inflating: rhel6_64/image/esxi_assoc_templates/FPGA.uxt  
  inflating: rhel6_64/image/libipmi_client.so  
  inflating: rhel6_64/image/UXLite_UPDATEID.dat  
  inflating: rhel6_64/image/libacpi.so  
  inflating: rhel6_64/image/libibmsp6_openipmi.so  
  ...
I managed to find a iflash64 binary, and after so trial and error, came up with the following workflow:
./iflash64 --host your_host_imm_IP --user your_username --password your_password --package imm2_1aoo56d-3.73.upd --reboot
IBM Command Line IMM Flash Update Utility v2.02.11
Licensed Materials - Property of IBM
(C) Copyright IBM Corp. 2009 - 2014  All Rights Reserved.
Connected to IMM at IP address x.x.x.x.
Update package firmware type: IMM2
Update package build level:   xxxxxx
Target's current build level: xxxxxx
Would you like to continue with the update? y/n: y
Node 0: The IMM is preparing to receive the update.
Node 0: Transferring image: 98%
Node 0: Transfer complete.
Node 0: Validating image.
Node 0: Updating firmware:  0%
Node 0: Updating firmware:  100%
Node 0: Update complete.
Performing activation of the firmware:
Connected to IMM at IP address 10.8.2.4 successfully.
..........
Waiting up to 360 seconds for the activation to complete:
..
..
And so on... Conclusio: Dear Hardware Vendors! Please don't force people to use specific Linux distributions when a simple unzip command would do the same.

22 April 2014

Mart&iacute;n Ferrari: DNSSEC, DANE, SSHFP, etc

While researching some security-related stuff for a post I am currently writing, I found some interesting bits here and there that I though I should share, as they were new to me, and probably for many others.

DNSSEC The first thing is DNSSEC. I knew about it, of course, but never bothered to dig much into it. While reading about many interesting applications of DNS for key distribution, and thinking of ways to use them, it is clear that DNSSEC is a precondition for any of that to work. In case you don't know about it, it is an extension for the DNS service to make it safer, for example, to avoid the bad guys having you think that google.com points to sniffer.nsa.gov. Apart from these ber-cool applications I was thinking about, avoiding DNS-based attacks becomes more and more relevant these days. And I think Debian and the rest of the Free Software world should work on making this available to all end-users as easily as possible. While adoption still looks pretty low, there are some good news. First, Google claims its public DNS supports DNSSEC. Of course, you need to trust Google servers, and the path between your machine and them. But if your resolver supports DNSSEC, you can use their servers and validate the answers. On the other side, I am not too sure about their implementation, as half of the time, it would return a valid answer to a query for an invalid record: dig +dnssec sigfail.verteiltesysteme.net @8.8.8.8). Also, they have not published DNSSEC records for google.com, which seems crazy. Some packages included in Debian already take advantage of DNSSEC, if available (more on that later), but more importantly, there are a couple of DNSSEC-enabled recursive servers, including bind, unbound, and the more commonly-used dnsmasq (there is a wiki page summarising Debian's status). Sadly, the default configuration for dnsmasq does not enable DNSSEC, and most people will not use it, even if it installed, because DHCP-provided servers are usually preferred. It seems to me that it would be wise to have a package that would install dnsmasq with DNSSEC enabled, and make it the only valid resolver for the system. If you want to check if your resolver is correctly validating DNSSEC, you can use this test web page. Another good news is that many top-level domains already support DNSSEC, and in my case, Gandi.net has support in place to set it up. So I am going to look into enabling it for my own domain.

SSHFP One useful and simple advantage of using DNSSEC, is that you can store information there, and then trust it to be correct. One new DNS RR (resource record) that is useful in this context is the SSHFP RR, which allows the sysadmin of a host to publish the host SSH key fingerprint in the DNS zone. The ssh client, when enabling the VerifyHostKeyDNS option, will use that information to trust unknown hosts. One downside to this, is that either if you set the option to ask, or if your resolver does not support DNSSEC, you get the same message, which does not warn you about the extra risk. To help you create your DNS records, you can just run this command:
$ ssh-keygen -r brie.tincho.org
brie.tincho.org IN SSHFP 1 1 6ac93c63379828b5b75847bc37d8ab2b48983343
brie.tincho.org IN SSHFP 2 1 cf0d11515367e3aa7eeb37056688f11b53c8ef23

DANE, S/MIME and GPG Recently, while at FOSDEM, I attended talks that mentioned DANE. This proposed IETF standard introduces a mechanism to use DNS as a secure key distribution system, which could completely override the CA oligopoly, a very attractive proposition for many people. In short, it is very similar to the SSHFP mechanism, but it is not restricted to SSH host keys: it can be used to distribute public key information for any TLS-enabled service. So, instead (or in addition to) of having a CA sign your certificate, and relying on the chain of trust by means of having a local copy of all root CA certificates, you use the chain of trust embedded in DNSSEC to make sure that the DNS RRs you publish are valid. Then, the client application can trust the fingerprint published for the relevant service to verify that it is talking to the right server. This is a very exciting development, and I hope it gets widespread adoption. It is already supported in Postfix, there seem to be some work going on in Mozilla, as well as in Prosody which is a great start. Another exciting development of this, is the generalisation of DANE for other entities, like email addresses. There are two draft RFCs being worked on right now to deploy S/MIME and OpenPGP key material using DNSSEC. This could also change completely the way we manage the Web of Trust.

21 April 2014

Matthew Garrett: Home entertainment implementations are pretty appalling

I picked up a Panasonic BDT-230 a couple of months ago. Then I discovered that even though it appeared fairly straightforward to make it DVD region free (I have a large pile of PAL region 2 DVDs), the US models refuse to play back PAL content. We live in an era of software-defined functionality. While Panasonic could have designed a separate hardware SKU with a hard block on PAL output, that would seem like unnecessary expense. So, playing with the firmware seemed like a reasonable start.

Panasonic provide a nice download site for firmware updates, so I grabbed the most recent and set to work. Binwalk found a squashfs filesystem, which was a good sign. Less good was the block at the end of the firmware with "RSA" written around it in large letters. The simple approach of hacking the firmware, building a new image and flashing it to the device didn't appear likely to work.

Which left dealing with the installed software. The BDT-230 is based on a Mediatek chipset, and like most (all?) Mediatek systems runs a large binary called "bdpprog" that spawns about eleventy billion threads and does pretty much everything. Runnings strings over that showed, well, rather a lot, but most promisingly included a reference to "/mnt/sda1/vudu/vudu.sh". Other references to /mnt/sda1 made it pretty clear that it was the mount point for USB mass storage. There were a couple of other constraints that had to be satisfied, but soon attempting to run Vudu was actually setting a blank root password and launching telnetd.

/acfg/config_file_global.txt was the next stop. This is a set of tokens and values with useful looking names like "IDX_GB_PTT_COUNTRYCODE". I tried changing the values, but unfortunately made a poor guess - on next reboot, the player had reset itself to DVD region 5, Blu Ray region C and was talking to me in Russian. More inconveniently, the Vudu icon had vanished and I couldn't launch a shell any more.

But where there's one obvious mechanism for running arbitrary code, there's probably another. /usr/local/bin/browser.sh contained the wonderful line:
export LD_PRELOAD=/mnt/sda1/bbb/libSegFault.so
, so then it was just a matter of building a library that hooked open() and launched inetd and dropping that into the right place, and then opening the browser.

This time I set the country code correctly, rebooted and now I can actually watch Monkey Dust again. Hurrah! But, at the same time, concerning. This software has been written without any concern for security, and it listens on the network by default. If it took me this little time to find two entirely independent ways to run arbitrary code on the device, it doesn't seem like a stretch to believe that there are probably other vulnerabilities that can be exploited with less need for physical access.

The depressing part of this is that there's no reason to believe that Panasonic are especially bad here - especially since a large number of vendors are shipping much the same Mediatek code, and so probably have similar (if not identical) issues. The future is made up of network-connected appliances that are using your electricity to mine somebody else's Dogecoin. Our nightmarish dystopia may be stranger than expected.

comment count unavailable comments

16 April 2014

Wouter Verhelst: Call for help for DVswitch maintenance

I've taken over "maintaining" DVswitch from Ben Hutchings a few years ago, since Ben realized he didn't have the time anymore to work on it well. After a number of years, I have to admit that I haven't done a very good job. Not becase I didn't want to work on it, but mainly because I don't have enough time to fix DVswitch against the numerous moving targets that it uses; the APIs of libav and of liblivemedia are fluent enough that just making sure everything remains compilable and in working order is quite a job. DVswitch is used by many people; DebConf, FOSDEM, and the CCC are just a few examples, but I know of at least three more. Most of these (apart from DebConf and FOSDEM) maintain local patches which I've been wanting to merge into the upstream version of dvswitch. However, my time is limited, and over the past few years I've not been able to get dvswitch into a state where I confidently felt I could upload it into Debian unstable for a release. One step we took in order to get that closer was to remove the liblivemedia dependency (which implied removing the support for RTSP sources). Unfortunately, the resulting situation wasn't good enough yet, since libav had changed API enough that current versions of DVswitch compiled against current versions of libav will segfault if you try to do anything useful. I must admit to myself that I don't have the time and/or skill set to maintain DVswitch on an acceptable level all by myself. So, this is a call for help: If you're using DVswitch for your conference and want to continue doing so, please talk to us. The first things we'll need to do: See you there?

15 April 2014

Colin Watson: Porting GHC: A Tale of Two Architectures

We had some requests to get GHC (the Glasgow Haskell Compiler) up and running on two new Ubuntu architectures: arm64, added in 13.10, and ppc64el, added in 14.04. This has been something of a saga, and has involved rather more late-night hacking than is probably good for me. Book the First: Recalled to a life of strange build systems You might not know it from the sheer bulk of uploads I do sometimes, but I actually don't speak a word of Haskell and it's not very high up my list of things to learn. But I am a pretty experienced build engineer, and I enjoy porting things to new architectures: I'm firmly of the belief that breadth of architecture support is a good way to shake out certain categories of issues in code, that it's worth doing aggressively across an entire distribution, and that, even if you don't think you need something now, new requirements have a habit of coming along when you least expect them and you might as well be prepared in advance. Furthermore, it annoys me when we have excessive noise in our build failure and proposed-migration output and I often put bits and pieces of spare time into gardening miscellaneous problems there, and at one point there was a lot of Haskell stuff on the list and it got a bit annoying to have to keep sending patches rather than just fixing things myself, and ... well, I ended up as probably the only non-Haskell-programmer on the Debian Haskell team and found myself fixing problems there in my free time. Life is a bit weird sometimes. Bootstrapping packages on a new architecture is a bit of a black art that only a fairly small number of relatively bitter and twisted people know very much about. Doing it in Ubuntu is specifically painful because we've always forbidden direct binary uploads: all binaries have to come from a build daemon. Compilers in particular often tend to be written in the language they compile, and it's not uncommon for them to build-depend on themselves: that is, you need a previous version of the compiler to build the compiler, stretching back to the dawn of time where somebody put things together with a big magnet or something. So how do you get started on a new architecture? Well, what we do in this case is we construct a binary somehow (usually involving cross-compilation) and insert it as a build-dependency for a proper build in Launchpad. The ability to do this is restricted to a small group of Canonical employees, partly because it's very easy to make mistakes and partly because things like the classic "Reflections on Trusting Trust" are in the backs of our minds somewhere. We have an iron rule for our own sanity that the injected build-dependencies must themselves have been built from the unmodified source package in Ubuntu, although there can be source modifications further back in the chain. Fortunately, we don't need to do this very often, but it does mean that as somebody who can do it I feel an obligation to try and unblock other people where I can. As far as constructing those build-dependencies goes, sometimes we look for binaries built by other distributions (particularly Debian), and that's pretty straightforward. In this case, though, these two architectures are pretty new and the Debian ports are only just getting going, and as far as I can tell none of the other distributions with active arm64 or ppc64el ports (or trivial name variants) has got as far as porting GHC yet. Well, OK. This was somewhere around the Christmas holidays and I had some time. Muggins here cracks his knuckles and decides to have a go at bootstrapping it from scratch. It can't be that hard, right? Not to mention that it was a blocker for over 600 entries on that build failure list I mentioned, which is definitely enough to make me sit up and take notice; we'd even had the odd customer request for it. Several attempts later and I was starting to doubt my sanity, not least for trying in the first place. We ship GHC 7.6, and upgrading to 7.8 is not a project I'd like to tackle until the much more experienced Haskell folks in Debian have switched to it in unstable. The porting documentation for 7.6 has bitrotted more or less beyond usability, and the corresponding documentation for 7.8 really isn't backportable to 7.6. I tried building 7.8 for ppc64el anyway, picking that on the basis that we had quicker hardware for it and didn't seem likely to be particularly more arduous than arm64 (ho ho), and I even got to the point of having a cross-built stage2 compiler (stage1, in the cross-building case, is a GHC binary that runs on your starting architecture and generates code for your target architecture) that I could copy over to a ppc64el box and try to use as the base for a fully-native build, but it segfaulted incomprehensibly just after spawning any child process. Compilers tend to do rather a lot, especially when they're built to use GCC to generate object code, so this was a pretty serious problem, and it resisted analysis. I poked at it for a while but didn't get anywhere, and I had other things to do so declared it a write-off and gave up. Book the Second: The golden thread of progress In March, another mailing list conversation prodded me into finding a blog entry by Karel Gardas on building GHC for arm64. This was enough to be worth another look, and indeed it turned out that (with some help from Karel in private mail) I was able to cross-build a compiler that actually worked and could be used to run a fully-native build that also worked. Of course this was 7.8, since as I mentioned cross-building 7.6 is unrealistically difficult unless you're considerably more of an expert on GHC's labyrinthine build system than I am. OK, no problem, right? Getting a GHC at all is the hard bit, and 7.8 must be at least as capable as 7.6, so it should be able to build 7.6 easily enough ... Not so much. What I'd missed here was that compiler engineers generally only care very much about building the compiler with older versions of itself, and if the language in question has any kind of deprecation cycle then the compiler itself is likely to be behind on various things compared to more typical code since it has to be buildable with older versions. This means that the removal of some deprecated interfaces from 7.8 posed a problem, as did some changes in certain primops that had gained an associated compatibility layer in 7.8 but nobody had gone back to put the corresponding compatibility layer into 7.6. GHC supports running Haskell code through the C preprocessor, and there's a __GLASGOW_HASKELL__ definition with the compiler's version number, so this was just a slog tracking down changes in git and adding #ifdef-guarded code that coped with the newer compiler (remembering that stage1 will be built with 7.8 and stage2 with stage1, i.e. 7.6, from the same source tree). More inscrutably, GHC has its own packaging system called Cabal which is also used by the compiler build process to determine which subpackages to build and how to link them against each other, and some crucial subpackages weren't being built: it looked like it was stuck on picking versions from "stage0" (i.e. the initial compiler used as an input to the whole process) when it should have been building its own. Eventually I figured out that this was because GHC's use of its packaging system hadn't anticipated this case, and was selecting the higher version of the ghc package itself from stage0 rather than the version it was about to build for itself, and thus never actually tried to build most of the compiler. Editing ghc_stage1_DEPS in ghc/stage1/package-data.mk after its initial generation sorted this out. One late night building round and round in circles for a while until I had something stable, and a Debian source upload to add basic support for the architecture name (and other changes which were a bit over the top in retrospect: I didn't need to touch the embedded copy of libffi, as we build with the system one), and I was able to feed this all into Launchpad and watch the builders munch away very satisfyingly at the Haskell library stack for a while. This was all interesting, and finally all that work was actually paying off in terms of getting to watch a slew of several hundred build failures vanish from arm64 (the final count was something like 640, I think). The fly in the ointment was that ppc64el was still blocked, as the problem there wasn't building 7.6, it was getting a working 7.8. But now I really did have other much more urgent things to do, so I figured I just wouldn't get to this by release time and stuck it on the figurative shelf. Book the Third: The track of a bug Then, last Friday, I cleared out my urgent pile and thought I'd have another quick look. (I get a bit obsessive about things like this that smell of "interesting intellectual puzzle".) slyfox on the #ghc IRC channel gave me some general debugging advice and, particularly usefully, a reduced example program that I could use to debug just the process-spawning problem without having to wade through noise from running the rest of the compiler. I reproduced the same problem there, and then found that the program crashed earlier (in stg_ap_0_fast, part of the run-time system) if I compiled it with +RTS -Da -RTS. I nailed it down to a small enough region of assembly that I could see all of the assembly, the source code, and an intermediate representation or two from the compiler, and then started meditating on what makes ppc64el special. You see, the vast majority of porting bugs come down to what I might call gross properties of the architecture. You have things like whether it's 32-bit or 64-bit, big-endian or little-endian, whether char is signed or unsigned, that sort of thing. There's a big table on the Debian wiki that handily summarises most of the important ones. Sometimes you have to deal with distribution-specific things like whether GL or GLES is used; often, especially for new variants of existing architectures, you have to cope with foolish configure scripts that think they can guess certain things from the architecture name and get it wrong (assuming that powerpc* means big-endian, for instance). We often have to update config.guess and config.sub, and on ppc64el we have the additional hassle of updating libtool macros too. But I've done a lot of this stuff and I'd accounted for everything I could think of. ppc64el is actually a lot like amd64 in terms of many of these porting-relevant properties, and not even that far off arm64 which I'd just successfully ported GHC to, so I couldn't be dealing with anything particularly obvious. There was some hand-written assembly which certainly could have been problematic, but I'd carefully checked that this wasn't being used by the "unregisterised" (no specialised machine dependencies, so relatively easy to port but not well-optimised) build I was using. A problem around spawning processes suggested a problem with SIGCHLD handling, but I ruled that out by slowing down the first child process that it spawned and using strace to confirm that SIGSEGV was the first signal received. What on earth was the problem? From some painstaking gdb work, one thing I eventually noticed was that stg_ap_0_fast's local stack appeared to be being corrupted by a function call, specifically a call to the colourfully-named debugBelch. Now, when IBM's toolchain engineers were putting together ppc64el based on ppc64, they took the opportunity to fix a number of problems with their ABI: there's an OpenJDK bug with a handy list of references. One of the things I noticed there was that there were some stack allocation optimisations in the new ABI, which affected functions that don't call any vararg functions and don't call any functions that take enough parameters that some of them have to be passed on the stack rather than in registers. debugBelch takes varargs: hmm. Now, the calling code isn't quite in C as such, but in a related dialect called "Cmm", a variant of C-- (yes, minus), that GHC uses to help bridge the gap between the functional world and its code generation, and which is compiled down to C by GHC. When importing C functions into Cmm, GHC generates prototypes for them, but it doesn't do enough parsing to work out the true prototype; instead, they all just get something like extern StgFunPtr f(void);. In most architectures you can get away with this, because the arguments get passed in the usual calling convention anyway and it all works out, but on ppc64el this means that the caller doesn't generate enough stack space and then the callee tries to save its varargs onto the stack in an area that in fact belongs to the caller, and suddenly everything goes south. Things were starting to make sense. Now, debugBelch is only used in optional debugging code; but runInteractiveProcess (the function associated with the initial round of failures) takes no fewer than twelve arguments, plenty to force some of them onto the stack. I poked around the GCC patch for this ABI change a bit and determined that it only optimised away the stack allocation if it had a full prototype for all the callees, so I guessed that changing those prototypes to extern StgFunPtr f(); might work: it's still technically wrong, not least because omitting the parameter list is an obsolescent feature in C11, but it's at least just omitting information about the parameter list rather than actively lying about it. I tweaked that and ran the cross-build from scratch again. Lo and behold, suddenly I had a working compiler, and I could go through the same build-7.6-using-7.8 procedure as with arm64, much more quickly this time now that I knew what I was doing. One upstream bug, one Debian upload, and several bootstrapping builds later, and GHC was up and running on another architecture in Launchpad. Success! Epilogue There's still more to do. I gather there may be a Google Summer of Code project in Linaro to write proper native code generation for GHC on arm64: this would make things a good deal faster, but also enable GHCi (the interpreter) and Template Haskell, and thus clear quite a few more build failures. Since there's already native code generation for ppc64 in GHC, getting it going for ppc64el would probably only be a couple of days' work at this point. But these are niceties by comparison, and I'm more than happy with what I got working for 14.04. The upshot of all of this is that I may be the first non-Haskell-programmer to ever port GHC to two entirely new architectures. I'm not sure if I gain much from that personally aside from a lot of lost sleep and being considered extremely strange. It has, however, been by far the most challenging set of packages I've ported, and a fascinating trip through some odd corners of build systems and undefined behaviour that I don't normally need to touch.

16 March 2014

Gunnar Wolf: Trying to get Blender to run in the CuBox-i (armhf): Help debugging Python!

As I posted some weeks ago, I have been playing with my CuBox-i4Pro, a gorgeous little ARM machine by SolidRun, built around an iMX6 system-on-a-chip. My first stabs at using it resulted in my previous post on how to get a base, almost-clean Debian distribution to run (Almost? Yes, the kernel requires some patches not yet accepted upstream, so I'm still running with a patched 3.0.35-8 kernel). After writing this step by step instructions, I followed them and built images ready to dd to a SD card and start running (available at my people.debian.org space. Now, what to do with this little machine? My version is by no means a limited box: 4 ARM cores, 2GB RAM make a quite decent box. In my case, this little machine will most likely be a home storage server with little innovation. However, the little guy is a power server, at only 3W consumption. I wanted to test its capabilities to do some number crunching and aid some of my friends The obvious candidate is building a Blender render farm. Right, the machines might be quite underpowered, but they are cheap (and look gorgeous!), so at least it's worth playing a bit! Just as a data point, running on an old hard disk (and not on my very slow SD card), the little machine was able to compile the Blender sources into a Debian package in 89m13.537s, that is, 5353 seconds. According to the Debian build logs (yes, for a different version, I tried with the version in Wheezy and in a clean Wheezy system), the time it took to build on some other architectures' build daemons was 1886s on i386, 1098s on PowerPC, 2003s on AMD64, 11513s on MIPS and 27721 on ARMHF. That means, my little machine is quite slower than desktop systems, but not unbearably so. But sadly, I have hit a wall, and have been unable to do any further progress. Blender segfaults at startup under the Debian armhf architecture. I have submitted bug report #739194 about this, but have got no replies to it yet. I did get the great help from my friends in the OFTC #debian-arm channel, but they could only help up to a given point. It seems the problem lies in the Python interpreter in armhf, not in Blender itself... But I cannot get much further either. I'm sending this as a blog post to try to get more eyeballs on my problem How selfish, right? :-) So, slightly going over the bug report, blender just dies at startup:
  1. $ blender -b -noaudio
  2. Segmentation fault
After being told that strace is of little help when debugging this kind of issues, I went via gdb. A full backtrace pointed to what feels like the right error point:
  1. (gdb) bt full
  2. #0 0x2acd8cce in PyErr_SetObject ()
  3. from /usr/lib/arm-linux-gnueabihf/libpython3.3m.so.1.0
  4. No symbol table info available.
  5. #1 0x2acd8c9a in PyErr_Format ()
  6. from /usr/lib/arm-linux-gnueabihf/libpython3.3m.so.1.0
  7. No symbol table info available.
  8. #2 0x2ac8262c in PyType_Ready ()
  9. from /usr/lib/arm-linux-gnueabihf/libpython3.3m.so.1.0
  10. No symbol table info available.
  11. #3 0x2ac55052 in _PyExc_Init ()
  12. from /usr/lib/arm-linux-gnueabihf/libpython3.3m.so.1.0
  13. No symbol table info available.
  14. #4 0x2ace95e2 in _Py_InitializeEx_Private ()
  15. from /usr/lib/arm-linux-gnueabihf/libpython3.3m.so.1.0
  16. No symbol table info available.
  17. #5 0x00697898 in BPY_python_start (argc=3, argv=0x7efff8d4)
  18. at /build/blender-dPxPUD/blender-2.69/source/blender/python/intern/bpy_interface.c:274
  19. py_tstate = 0x0
  20. py_path_bundle = 0x0
  21. program_path_wchar = L"/usr/bin/blender", '\000' <repeats 1007 times>
  22. #6 0x0044a1ca in WM_init (C=C@entry=0x1d83220, argc=argc@entry=3,
  23. argv=argv@entry=0x7efff8d4)
  24. at /build/blender-dPxPUD/blender-2.69/source/blender/windowmanager/intern/wm_init_exit.c:176
  25. No locals.
  26. #7 0x0042e4de in main (argc=3, argv=0x7efff8d4)
  27. at /build/blender-dPxPUD/blender-2.69/source/creator/creator.c:1597
  28. C = 0x1d83220
  29. syshandle = 0x1d8a338
  30. ba = 0x1d8a840
  31. (gdb)
I'm not pasting here the full bug history (go to the bug report for the full information!), but it does point me to this being a problem in Python-land: It points to something not found at line 59 of Python/errors.c. And what I understand from that line is that some kind of unknown exception is thrown, and the Python interpreter does not now what to do with it. The check done at line 59 is the if (exception != NULL ** ....:
  1. void
  2. PyErr_SetObject(PyObject *exception, PyObject *value)
  3. PyThreadState *tstate = PyThreadState_GET();
  4. PyObject *exc_value;
  5. PyObject *tb = NULL;
  6. if (exception != NULL &&
  7. !PyExceptionClass_Check(exception))
  8. PyErr_Format(PyExc_SystemError,
  9. "exception %R not a BaseException subclass",
  10. exception);
  11. return;
So... Dear lazyweb: Any pointers on where to go on from here? kthxbye.

15 December 2013

Hideki Yamane: my Desktop Environment broken - any help?


Recently, my GNOME environment was broken suddenly. Startup GNOME Shell cause segfault and die. I cannot use GDM3 at all.

henrich@hp:~ $ grep gnome-shell /var/log/syslog
Dec 15 10:51:12 hp kernel: [34944.969581] gnome-shell[10054]: segfault at 84 ip 00007f6c6186a7b9 sp 00007fff59f1dbc0 error 4 in libcogl.so.12.1.1[7f6c61821000+97000]
Dec 15 10:51:12 hp gnome-session[9565]: WARNING: Application 'gnome-shell.desktop' killed by signal 11
Dec 15 10:51:14 hp kernel: [34946.429046] gnome-shell[10202]: segfault at 84 ip 00007ffa2ae877b9 sp 00007fff2bc5c7d0 error 4 in libcogl.so.12.1.1[7ffa2ae3e000+97000]
Dec 15 10:51:14 hp gnome-session[9565]: WARNING: Application 'gnome-shell.desktop' killed by signal 11
Dec 15 10:51:14 hp gnome-session[9565]: WARNING: App 'gnome-shell.desktop' respawning too quickly
So, I've switched to KDM, and install xfce4. However, I faced to another problem - it cannot show any icons (desktop and panel).


Odd Chromium window (why red?).


Also eog says "Could not load image" "Unrecognized image file format" - Well, why? It's just a .png file. (screenshot on KDE4)
Shotwell cause segfault, too :-( (and some other GTK applications)

I cannot point out broken component. Dear lazyweb, could you tell me how to fix this situation, please?


Next.

Previous.